Linear Classifiers cannot deal with non-linear boundaries.

such as two circles with different radius but the same center point.

—> use polar coordinate to create a feature space. rr is the x axis and the θ\theta is the y axis.

all points on the same circle is seated on the same vertical line that is parallel to the y axis.

(Before) Linear score function: only a small part of Feature Extraction can adjust itself to better maximizing its ability.

Learn only one template of one category.

(After) Neural Network: raw picture pixel —> classification scores

Learn several templates of one category.

Linear score function: f=Wxf = Wx

2-layer Neural Network:f=W2max(0,W1x)f=W_2\max(0,W_1x)

W2RC×HW1RH×DxRDW_2 \in \mathbb{R}^{C\times H} \: W_1 \in \mathbb{R}^{H\times D}\: x \in \mathbb{R}^D

h=W1x=(α1,α2,,αH)Txh = W_1x = (\alpha_1 ,\alpha_2,\cdots,\alpha_H)^Tx

Element (i,j)(i,j) of W1W_1 gives the effect on hih_i from xjx_j

Deep Neural Networks: Depth = number of layers = number of Matrix

​ Width = Size of each layer

Activation Functions:

Without the activation function,we will go back to f=W2W1x=Wxf=W_2W_1x=Wx which is linear classifiers.

Activation FunctionsExpressionGraph
Sigmoidσ(x)=11+ex\sigma(x)=\frac{1}{1+e^{-x}}Sigmoid Function
tanhtanh(x)tanhx
ReLU(A good default choice for most problems)max(0,x)ReLU

A simple achievement:

import numpy as np
from numpy.random import randn

N,Din,H,Dout = 64,1000,100,10
x,y = randn(N,Din),randn(N,Dout)
w1,w2 = randn(Din,H),randn(H,Dout)
for t in range(10000):
    h = 1.0 / (1.0 + np.exp(-x.dot(w1)))
    y_pred = h.dot(w2)
    loss = np.square(y_pred - y).sum()
    dy_pred = 2.0 * (y_pred - y)
    dw2 = h.T.dot(dy_pred)
    dh = dy_pred.dot(w2.T)
    dw1 = x.T.dot(dh*h*(1-h))
    w1 -= 1e-4 * dw1
    w2 -= 1e-4 * dw2

Space warping:

Linear transform cannot linearly separate points even in feature space.

but with ReLU function,Space Warping

Universal Approximation:

​ use layer bias to move the graph

UA

use many ReLU to approach the function.

to reach 0 or unchanged: slope should be opposite

let coefficient of x be 1,only change the shaping factor of MAX.

Convex Functions:

f:XRNRf:X \subset \mathbb{R}^N \rightarrow \mathbb{R} is convex if for all x1,x2X,t[0,1],f(tx1+(1t)x2)tf(x1)+(1t)f(x2)x_1,x_2 \in X,t\in[0,1],f(tx_1+(1-t)x_2)\leq tf(x_1)+(1-t)f(x_2)

convex is easy to optimize